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ABSTRACT 


The Armed Services Vocational Aptitude Battery (ASVAB) is a test that 
approximately 700,000 students in 12,000 high schools take each year to determine 
military occupation placement. Form Assembly for the ASVAB refers to the selection 
of 20-35 questions, known as items, from an item pool of approximately 300 items to 
create a paper and pencil test in one of its ten topics. Previous research formulates form 
assembly as an Integer Linear Program (ILP). The current ASVAB mostly uses a 
Computer Adaptive Test (CAT), which estimates an examinee’s ability after the 
examinee answers each item and selects the next item based on prior performance. The 
current CAT-ASVAB implementation does not control the number of items selected from 
each subject (taxonomy group) for a test. This thesis introduces ILPs, previously used for 
form assembly, that impose taxonomy restrictions and applies them to the CAT-ASVAB. 
We create four ILP variations and test them against the current method of item selection, 
by simulating 3,500 examinees (500 examinees each for seven given ability levels). The 
results show that all of the ILPs have acceptable solution times for CAT use, and 
taxonomy restrictions can be imposed while also having more even exposure rates (the 
number of times an item is administered divided by the number of examinees) than the 
current implementation of the CAT-ASVAB. A variation that relaxes most of the binary 
variables and constrains the difficulty of each item to be within a predetermined 
magnitude of the current ability estimate, performs the best in terms of item exposure (for 
both under and over-utilized items) and error between an examinee’s estimated ability 


level and actual ability level. 
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EXECUTIVE SUMMARY 


The Armed Services Vocational Aptitude Battery (ASVAB) is a test that 
approximately 700,000 students in 12,000 high schools take each year to determine 
military occupation placement. Form Assembly for the ASVAB refers to the selection 
of 20-35 questions, known as items, from an item pool of approximately 300 items to 
create a paper and pencil test in one of its ten topics. ASVAB form assembly has been 
previously formulated as an integer linear program (ILP) with an objective function that 


minimizes the deviation from a predetermined goal curve for the test. 


Most of the ASVAB tests are administered as a Computer Adaptive Test (CAT). 
The CAT estimates an examinee’s ability after the examinee answers each item and 
selects the next item based on prior performance. Because the CAT is able to determine 
an examinee’s ability level after each question and select future questions based on this 
estimator, the test length for a CAT is shorter than a paper and pencil test. However, the 
current CAT-ASVAB does not control the number of items selected from each subject 
(taxonomy group) for a test. Therefore, this taxonomy distribution of the items in a test 
can be heavily skewed toward a particular subject. A solution to this problem is for a test 
to not only select the next item, but select an entire test trajectory for the examinee’s 
current estimated ability. This is called a shadow test, and this thesis combines a 
shadow test with previously researched paper and pencil form assembly for application to 


the CAT-ASVAB. 


This thesis also discusses other problems associated with the CAT, such as item 
exposure control and solution time. One method it explores is item-stratification. In this 
method, the item selection algorithm divides the item pool into groups according to their 
discrimination parameter (an item with a high discrimination parameter is able to separate 
examinees with nearly the same ability, whereas a low discrimination parameter does not 


separate them as well) and divides the test into an equal number of stages. The purpose is 


XV 


to select items with a lower discrimination (and therefore lower information value) 
toward the beginning of a test, and leave items with a higher discrimination (and higher 


information value) until the end when the ability estimate is more accurate. 


There are five variations of CAT-ASVAB item selection considered in this thesis: 
1) A previously researched paper and pencil form assembly method for the ASVAB 
(KM); 2) KM that constrains the difficulty parameter (a parameter that measures the 
difficulty of an item) to be within a certain amount of the current ability level of the 
examinee (DM); 3) KM with the addition of item-stratification constraints (SM); and 4) 
KM that has both difficulty parameter constraints and item stratification constraints 
(SDM); 5) The current item selection method of the CAT-ASVAB (OM), is a 
benchmark to compare the other four. Each of the five variations of the model is 
examined using 3,500 artificially generated examinees (500 examinees each for seven 
given ability levels). Aside from SM and SDM having a high maximum exposure rate, 
our results indicate that all of the shadow test variations have more even exposure rates 
than the current implementation of the CAT-ASVAB, having significantly less unutilized 
items. DM performs the best in terms of item exposure (for both under and over-utilized 
items) and error between an examinee’s estimated ability level and actual ability level. 
All of the variations benefit from the ability to add taxonomy constraints. Without the 
taxonomy constraints, our results suggest that the current CAT implementation has a 


taxonomy distribution heavily favoring one of the taxonomy groups. 


XVI 


I. INTRODUCTION 


Since 1968, all US military applicants take the Armed Services Vocational 
Aptitude Battery (ASVAB) to determine military occupation placement. Approximately 
700,000 students in 12,000 High Schools take this test every year [Pommerich 2005]. 
Form assembly for the ASVAB refers to the selection of multiple choice questions, 
known as items, out of a given item pool to create a paper and pencil test in one of its ten 
topics. A typical form has 20-35 items selected from an item pool of approximately 300 
items. Kunde [1997] formulates form assembly as an integer linear program (ILP) and 


solves it both optimally and using heuristics. 


In 1997, many ASVAB tests were still commonly administered in their printed 
(paper and pencil) form. The ASVAB has since moved toward being a Computer 
Adaptive Test (CAT) [e.g., Weiss 2004]. Other tests that use a CAT include the GRE 
[e.g., Syvum 2006] and GMAT [e.g., Princeton Review 2006]. The CAT estimates an 
examinee’s ability after the examinee answers each item and selects the next item based 
on this estimator. This allows it to use fewer items than a paper and pencil exam to 


determine an examinee’s ability. 


The current CAT-ASVAB item selection algorithm does not currently take into 
account item taxonomy constraints [Sands, Waters, and McBride 1999]. A taxonomy 
constraint imposes a limit on the number of items from a given subject (e.g. Addition, 
Division, etc.). Veldkamp and van der Linden (2004) use a shadow test to determine the 
next question. A shadow test creates a whole test trajectory for the examinee’s current 
estimated ability then chooses the best item amongst that trajectory to administer. By 
creating this whole test, other constraints can be added to the formulation, including 


taxonomy constraints. 


This thesis extends the ILP from Kunde [1997] for use as a shadow test and 
applies it to item selection for a CAT-ASVAB. The primary extensions speed solution 
time and control item exposure. Item exposure control refers to limiting the number of 


1 


times a test administers an item to a set of examinees. Too many examinees receiving the 
same item increases the likelihood of a future examinee having advanced knowledge of 


an item. 


Il. BACKGROUND 


A. TEST THEORY 
The ASVAB uses Item Response Theory (IRT) to measure the precision of each 


test. An examinee’s ability level is denoted as 0. It is assumed that 0 follows a standard 
normal distribution (mean of zero and a standard deviation of one). The range of 0 is 
commonly set between -3.0 and 3.0 or -2.5 or 2.5 [Sands, Waters, and McBride 1999]. In 
IRT, the probability of an examinee, with ability level 8, answering an item correctly is 
calculated with the three parameter logistic function shown below [Lord 1980]: 


l-c 
p()= c+ 14 Due) * 








Probability of Correct Answer 


Inflection 974 > 
Point 


(6) 
a 
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Figure 1: Sample Logistic Function 
In the above sample, the discrimination parameter: a=2.24, the 
difficulty parameter: b=0.72, and the guessing parameter: c=0.4 


The 3 parameters are a, b, and c, with D being a scaling factor. The a parameter 
is the discrimination of the item. This is the capability of the item to distinguish between 
applicants of different abilities. In Figure 1, the a parameter is proportional to the slope 
of the logistic function at its inflection point. The steeper the slope, the greater the 
difference examinees with different ability levels have in answering an item correctly; a 
flatter slope means examinees with different ability levels have more similar probabilities 


of a correct response. The b parameter measures the difficulty of an item. In Figure 1, 
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the b parameter determines the position of the curve’s inflection point along the 0-axis. 
Finally, parameter c is the guessing parameter. This is the probability of a person with a 
low ability level guessing the item correctly. This parameter shows up in Figure | as the 
lower asymptotic bound on p(@)’s axis. These parameters are typically calculated after 
the item has been pretested 1,000 to 10,000 times. From here, the item information 
function can be derived from p(@), [Lord 1980] 
p'(9) 
P(A) — p()) 





1,(@) = 


D?a2(1-c) 
(c hs gree) \d + eee ey 2 





1,(9) = 


where p' is the derivative of p. The presence of the derivative in the numerator 


indicates that items with a higher discrimination parameter have a higher information 
value. Because the information contribution of each item is assumed to be independent 
of the other items in the ASVAB, the item information functions can be added together to 
produce an overall information curve. With N being the number of items in the form, the 


exam information function is [Lord 1980]: 


10) = > 1,8). 


i=l 


This function measures the precision of the exam in estimating an examinee’s true 
ability level. The next section shows how the above information function is applicable to 


form assembly. 


B. OPTIMIZATION OF FORM ASSEMBLY FOR ASVAB (PAPER AND 
PENCIL) 


This section describes Kunde’s paper and pencil formulation, which is used in the 
optimization model in this thesis for the CAT-ASVAB. Kunde’s formulation has two 


goals expressed in the objective function. The first is to minimize the difference between 


the information of the exam and the information from a goal curve. A goal curve is a test 
information function like the one introduced in the previous section that represents the 
desired information distribution of the exam across the ability levels. It is produced from 
empirical research and testing. The deviations between an assembled form and the goal 
curve for specific values of @ are organized by their magnitude into groups which are 
denoted in the formulation below by the index g. Each group is assigned a penalty per 
unit of deviation. Higher deviations from the goal curve receive a higher penalty per unit 


deviation. 


For security purposes, alternate forms are created for an exam (denoted by the 
index f). This leads to the second goal of the formulation: to make each form as similar 
as possible in information. The second component of the objective function seeks to 


minimize the deviations of each form from the first reference form. 


Below is Kunde’s integer linear program formulation for the paper and pencil 


form. 
Indices: 
i item from the item pool; 
0 ability level; 
f form to be assembled (1,2,...F); 
t taxonomy(1,2,...T); 
g penalty group 
Sets: 
TaxItems, The set of items in taxonomy group ¢ 
Data: 
CAT, The maximum deviation between a form and the goal curve in group g 
INF, 9 Information value of item i at percentile 0 
NITEM, The required number of items in taxonomy ¢ 


PARAWEI Weight that combines the two goals 


PENALTY, Penalty per unit deviation within group g 


SHAPE, The information value for the goal curve at percentile 0 
Variables: 

ai One, if item 7 is used in form f 

PY ge Deviation above the goal curve in group g at percentile 6 on form f 

NY ge Deviation below the goal curve in group g at percentile 0 on form f 

delplus , The total information form one contains that exceeds form f 

delneg , The total information form f contains that exceeds form one 

Formulation: 

min 

>>) >! PENALTY, (py g, + 2Vy,) + PARAWEIY? (delplus , — delneg ,) (k1) 
af 8g f>l 

such that 

> PV ae 2 > INFigx, - SHAPE, V6, f (k2) 
g i 

> nV 9 2—>, INFioxy + SHAPE, VO, f (k3) 
g i 

> xy = NITEM, Vit (k4) 

icfaxltem, 

eS! Vi (k5) 
f 

DY INF 6X, — >, INF oX¢ = delplus , - delneg Vf >1 (k6) 
i Oo i 6 

O< pyy,< CAT, VO,f,g  (k7) 

O< nyg, < CAT, VO,f,g  (k8) 

x, binary Vig (k9) 

delplus , , delneg , 20 Mp (k10) 


The first component in the objective function (k1), corresponding to the first goal 


of minimizing deviation from the goal curve, y a De PENALTY, (PY of +o)» 


expresses the vertical deviation from the goal curve. The variables py,, and ny,, are 


the positive and negative deviations, respectively, of form f from the goal curve, in group 


g, for ability 0. In the second component of the objective function, 


PARAWEI > Pai delplus,+delneg,), the variabledelplus, is the total form one 


information in excess of form f, while de/neg, is the total form f information in excess of 


form one. 


Constraints sets (k2) and (k3) give the values for the positive and negative 
deviations of the information function from the goal curve. Set (k4) specifies the number 
of items in a form from a given taxonomy. Set (k5) states that item 7 can only appear in 
at most one form. Set (k6) gives the total difference in information between the forms, 
and sets (k7) and (k8) bound the deviations of the information function from the goal 


curve. 


C. CAT-ASVAB 

The formulation above optimizes the objective function across all Os, and creates 
a form that satisfies a set of specified attributes (e.g., length and taxonomy). In a CAT, 
the examinee’s current performance on the exam determines each item that is 
administered. Therefore, at a given point in an exam, an individual with a higher ability 
level receives an item of more difficulty than an individual with a lower estimated ability. 
Because the examinee receives an item based on his estimated ability, the exam can 
produce a better estimate for the examinee’s ability in fewer questions. As currently 


implemented, all examinees start with the same average ability level estimate, 0,)= 0. 


The CAT-ASVAB uses the Owens Bayes algorithm of calculating the ability after each 

item is answered. Because the order of items administered affects the ability calculation, 

an additional Bayesian module calculation is used to calculate @ at the end of the test. 

Currently, the item selection algorithm for the CAT-ASVAB seeks to maximize the item 
4 


information function at the examinee’s current 9 and limit item exposure. The 


information values are pulled from a table by 8. [Sands, Waters, and McBride 1999] 


1. Shadow Test 

One method proposed to deal with the taxonomy constraints is a shadow test [e.g. 
van der Linden and Veldkamp 1998]. Instead of merely calculating the best item to 
administer at the current 8, a whole test trajectory is constructed for the examinee at the 
current 8. The indices used in the formulation below are the same as in Kunde’s 
formulation with the addition of an index h, a quantitative attribute group. An example 
of a quantitative attribute group is the total word count for all items in the group adding 
up to a pre-specified total. Thus, a possible constraint would be to limit the total word 


count for a set of items in each group h. This is represented by the following constraints: 


> Les Ua, Vh 
ieOh 

> Ly 2 LA, Vh, 
ieOy 


where L,,, in this example, is the word count for item 7, UH, and LH, are an upper and 
lower bound respectively on the sum of the word counts for all items in group h, and Q, 


is the set of items in group A. Below is Veldkamp and van der Linden’s formulation 


using notation consistent with Kunde’s formulation above. 


Indices: 
k iteration count where examinee is given his Ath question 
h quantitative attribute group 
Sets: 
Fix The set of items already administered 
O, The set of items in quantitative attribute group h 
Data: 


6... Current ability estimate after k-1 items have been administered 


Ly Quantitative attribute for item i for attribute group h 
UH, Upper bound for number of items in group h 
LH, Lower bound for number of items in group h 
UT, Upper bound for number of items in taxonomy t 
LT, Lower bound for number of items in taxonomy t 
I,(@) The item information value at 0 
Decision Variable: 
x; One, if item 7 is used in the shadow test 
Formulation: 
max VLG )%; (v1) 
such that 
x, =1 Vie Fix (v2) 
Dy aeeSUL Vt (v3) 
icTaxItemsy 
LL Vt (v4) 
icTaxItemsy 
Lt Ua; Vh (v5) 
ieQ, 
SL SLA, Vh (v6) 
ieQ, 
are =N (v7) 
x, binary Vi (v8) 


The model selects the item with the greatest information from the items in the 


shadow test that have not already been administered at the current ability, 6... 
Constraint set (v2) sets x, to 1 for the items i that have already been administered. 


Constraint sets (v3) and (v4) are taxonomy constraints and set an upper and lower limit 
9 


respectively on the number of items administered from each taxonomy group. Constraint 
sets (v5) and (v6) are the above mentioned quantitative attribute constraints. “Because 
each shadow test meets the constraints, the adaptive test automatically meets them” [van 


der Linden and Veldkamp 2004]. 


Zi Taxonomy and Item Exposure Control Research for CAT 

Much research has been done on different ways to implement CAT. Because one 
of the main concerns with CAT is item exposure control, many papers written about CAT 
implementation discuss possible solutions for this issue. The CAT-ASVAB currently 
uses Sympson and Hetter’s [1985] algorithm to control item exposure. This thesis uses 
this algorithm for its optimization model as well. The Sympson and Hetter algorithm 
assigns a number between zero and one, called the item exposure parameter, to each item. 
A pretest simulation determines these parameters. Items with a higher exposure rate at 
the end of the simulation receive a lower exposure parameter. During the actual test, 
when the test selects an item, it generates a random number uniformly distributed 
between zero and one. If the item exposure parameter of this item is less than the random 
number, the test rejects the item and selects the item with the next highest information 


value, and so on. 


Another technique to control item exposure is called 5-4-3-2-1 [Sympson and 
Hetter 1997]. The first item is chosen randomly out of the five most informative items. 
The next item is then chosen randomly out of the four most informative, and so on until it 
is choosing from one item. Afterwards, the procedure starts over again at five items. 
Another randomization technique is to choose one item out of three, then disqualify the 
other two from further administration [Thomasson 1998]. Another technique does not 
use the item information value, but randomly selects from items within a specified 


distance from a target difficulty level [Lunz and Stahl 1998]. 


Other methods require a more significant change in item or test structure to 
address item exposure control. One method is item stratification, and this thesis also 


includes this method into its optimization model. Items fall into n groups called strata by 
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their a parameters, and exams divide into n stages. For a model with taxonomy 
constraints, this first categorizes the items by their taxonomy before sorting the items 
within each taxonomy by the a parameter. It then divides the items in each taxonomy 
into n groups. Items from the first group in each taxonomy go into the first strata, items 
from the second group go into the second strata, and so on until there aren strata. During 
the nth stage, the test selects an item from the nth strata [Leung, Chang, and Hau 2003]. 
Item stratification selects items with a lower discrimination value near the beginning of 
the test. Because items with a higher discrimination also carry higher information values, 
item stratification is contrary to the typical approach of selecting the item with the highest 
information value. Item stratification reserves the items that carry more information 
toward the end of the exam where the ability estimate is closer to the true ability. In a 
study done by Chang and van der Linden, item stratification yields more even exposure 
rates throughout the items, thus having fewer underexposed and overexposed items. 
Below is the formulation of the item stratification model into a shadow test. The indices 
are the same as the shadow test formulation given in the previous section, with the 


addition of the index r, the stratum. [Chang and van der Linden, 2003] 


Indices: 

r stratum; 
Sets: 

Q., The set of items at the strata r when selecting item k 
Data: 

S. The required number of items from strata r 

B, Difficulty of item i (standard deviations from 0=0) 
Variables: 

y Deviation of item’s difficulty parameter from 6, 
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Formulation: 


min y (cl) 
such that 
(B,-9,.,) x; Sy VieQ, (c2) 
(B,-0,,) x, >-y vieQ, (3) 
x, =1 VieFix (c4) 
ye = S., Vr (c5) 
iD, 
> 4, SUE Vt (c6) 
icTaxItemsy 
Dy ane Vt (c7) 
icTaxItemsy 
» 15,5 UA, Vh (c8) 
icQ,, 
Epes Le; Vh (c9) 
icQ, 
y20 (cl0) 
x, binary Vi (cl1) 


A 


Items with a difficulty parameter closest to the current estimate of ability, 6,_,, 


are chosen within the given constraints. Constraint set (c4) specifies the number of items 
that must come from each strata. The rest of the constraints are the same as the shadow 


test. 


Another method, the Computerized Adaptive Sequential Test (CAST), partitions 
the test into a collection of subtests such that these subtests become the units of test 
administration instead of items [Davis and Dodd 2003]. This method groups the items 
into subtests called modules and places them in multistaged panels. There are two ways 
to construct the panels. The first is bottom-up construction that assembles the items into 


modules such that each module, as a self-contained unit, meets the requisite information, 
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content, and item feature targets selected for the test [Davis and Dodd 2003]. The second 
method of panel construction is top-down, where any module path through the panel 
results in a test of appropriate precision, content, and item type [Davis and Dodd 2003]. 
The method used in Davis and Dodd’s study is the bottom-up construction. With the 
exception of the first stage, the test segregates the modules by difficulty level in each 
stage. The first stage has only one module. A typical allocation for the other stages 
would place three modules in the second and third stage, with each module corresponding 
to a low, medium, and high difficulty. A panel is randomly assigned to an examinee at 
the beginning. From there, at the first stage, the examinee receives a subtest. When the 
examinee completes the module, the test calculates his ability, and in the next stage, it 
bases the next module the examinee receives on his current estimated ability. An 
examinee can only move up one level between stages. For example, one cannot receive 
an easy module after completing a hard module the stage before. Like a-stratification, 


this method also yielded more even exposure rates [David and Dodd 2003]. 


Two of the methods mentioned above for item exposure control, the Sympson and 
Hetter algorithm and item stratification, are incorporated into the optimization model for 
this thesis as well as alternate forms from the paper and pencil exam. Shadow tests in the 
existing research use the existing maximum information or minimum difficulty deviation 
as objective functions. The formulation in the following section, however, uses the 
deviation from a goal curve as in Kunde’s paper and pencil formulation for the objective 


function. 
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Hl. THE CAT-ASVAB OPTIMIZATION MODELS 


A. SHADOW TEST FORMULATION AND VARIATIONS 

The integer linear program (ILP) in this thesis uses Kunde’s formulation as a 
starting point and adapts it for use in the CAT-ASVAB as a shadow test. In his paper and 
pencil formulation, Kunde uses alternative forms as a means of test security. This 
shadow test formulation retains the alternative forms as a means of item exposure control 
along with the Sympson-Hetter method. For this thesis, the test creates two forms, with 
15 items each, for each shadow test. An examinee starts off on one of the forms. Each 
item selected first goes through the Sympson-Hetter algorithm. If the algorithm rejects 
an item, the test administers the item with the most information from the alternative form. 
The test does not use the rejected items again for the remainder of the exam. If the 
Sympson-Hetter algorithm also rejects the item from the alternative form, the test goes 
back and selects the next most informative item from the first form, and so on. If the 
items in the shadow tests to choose from run out, the test reruns the model to obtain a 


new shadow test. 


As mentioned earlier, the solution time of the shadow test is critical. To speed up 


solution times, this formulation relaxes Kunde’s ILP such that only the x,, value for the 
current item needs to be binary, while the rest of the x,, values can be continuous. 


Allowing continuous variables could decrease overall solution quality, but we did not 


observe any substantial differences. For the relaxation, the formulation splits x, into a 


binary and continuous component, xb, and xc, , respectively. Therefore the constraint 


if ? 


set from the original formulation: 


x, binary Vi, f 


is replaced with the below constraint sets. 


ca Vi, f 
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O<xe, <1 Vi, fi 
xb, binary Vi, f 
Xp = XDy + XCip Vist 


To specify that at least one x.,, other than the administered items, is an integer, the 


if ? 


following constraint is added. 


di xb, > yxy +1 Vf 


icFix 


Kunde’s formulation, along with the addition of the above constraints, establishes 
the base model for this thesis (KM). For this thesis, we develop three other variations for 
comparison. One variation (DM) comes from the observation that items administered 
with a higher deviation between the b parameter and current ability estimate tend to have 
a smaller effect on the ability estimate. For example, if an individual answered an item 
correctly in which the difficulty parameter was far below his current ability, it would 
barely affect the new ability estimate. Therefore, for this variation, the two constraints 
below are added to constrain the difficulty parameter to be within a given number, BLIM, 


of the current ability estimate. 


(b,-8,,) x, <BLIM Vi ¢ Fix 
(b,-6,,) x, 2 - BLIM Vi ¢ Fix 


Using the same notation as Kunde’s formulation and van der Linden’s sample 


shadow test, below is the formulation for this variation. 


Data: 


BLIM Maximum deviation of item difficulty from current ability 


Variables: 
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Xp One, if item i is used in form f 

XC ip Continuous component of x, 

xb, Binary component of x,, 

Formulation: 

min 

> by oy PENALTY, (PY g, + MV gz) + PARA WEI (delplus , — delneg ,) (d1) 
OF 8 f>l 

s.t. 

> PV gy 2 > INFigx, — SHAPE, VO, f (d2) 
g i 

> ny ¢@ 2 — >, INFioxy + SHAPE, VO, f (d3) 
g i 

> xy = NITEM, ia (d4) 

icfaxltem, 

Sys Vi (d5) 
f 

(b,-8,,) x, <BLIM Vi ¢ Fix, f (d6) 

(b,- 8...) x =- BLIM Vi ¢ Fix, f (d7) 

Sy SUINE 55 - YD INF oxy = delplus , - delneg ; Vf >1 (d8) 
i 0 i 6 

O< pyy,< CAT, VO,f,g (d9) 

O< nyy,< CAT, VO,f,g  (d10) 

xy =1 Vi € Fix, f (dll) 

Xp = XDy + XCip Mia - (d12) 

do xb, = x +1 vf (d13) 
i icFix 

O<x, <1 Vi, f (d13) 

O<xc, <1 WE: (d15) 
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xb, binary Vie (d16) 


delplus, , delneg ,2 0 Vf (d17) 


The second variation (SM) uses item stratification. It adds the below constraint, 
adapted from Chang and van der Linden’s shadow test formulation with item 


stratification, to the formulation. 
yey = S. Vit 
icQ, 


In order to ensure that the decision variable for an item from the current stage is binary, 
the formulation sets all of the items in the shadow test at the current stage as binary. The 


below constraint achieves this purpose. 
Py xe=8, Vr = CURSTG, f 
icO, 
where CURSTG is the current stage of the exam. 
The third variation (SDM) combines the DM and SM formulations. However, 
instead of adding the two constraints to limit the difficulty parameter, the formulation 
relaxes the two constraints and inserts them into the objective function as a price for 


deviating too far from the current ability estimate. The new objective function is 


therefore 


min > ye > PENALTY, (PY gz + WV gz) + PARA WEI. (delplus , — delneg ;) 


of 8g f>l 


+ DIFPEN >. >" (phdev,. + nbdev,,) 
i aft 


where pbdev,, and nbdev,, are given below 


(b,-8,.,) x, <BLIM + pbdev, Vi ¢ Fix, f 
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(b,- 8.) x =~ BLIM - nbdev, Vi ¢ Fix, f, 


and DIFPEN is the penalty per unit for more than BLM units over or under the current 
ability estimate. The reason for not adding the difficulty constraints directly into the 
formulation is because combined with the item stratification constraints, the addition of 
the difficulty parameter constraints tends to result in an infeasible solution. Below is the 


SDM formulation. 


Data: 
CURSTG Current stage of exam 


Variables: 


phdev,, The additional positive deviation of item i’s difficulty parameter from the 


current ability estimate greater than BLIM 


nbdev, The additional negative deviation of item i’s difficulty parameter from the 


current ability estimate less than BLIM 


Formulation: 


Min 


ys bE > PENALTY, (PY ge + MV gy, ) + PARA WEI (delplus , — delneg ,) 


Oo f g f>l 

+ DIFPEN >">" (phdev,, + nbdev,, ) (sd1) 

if 

S.t. 

> Pag 2 > INFigx, - SHAPE, VO, f (sd2) 
g i 

> WY 9 2 —>_, INFiox, + SHAPE, VO, f (sd3) 
g i 
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> xy = NITEM, Vf ot (sd4) 


icTaxItem, 
Yas Vi (sd5) 
fo 
(b,-8,.,) x, <BLIM + pbdev, VieFix,f  (sd6) 
(b,- 8.) =~ BLIM - nbdev, VieFix,f (sd7) 
SINE oxi —SY INF 9X = delplus , - delneg , Vf >1 (sd8) 
i Oo i 0 
O< pyy, < CAT, VO,f52 (sd9) 
O< nyg,< CAT, VO,f,g  (sd10) 
Xx, =1 Vi € Fix, f (sd11) 
Xp = XDy + XCip Vad (sd12) 
> xb, = 8, Vr = CURSTG, f  (sd13) 
iQ, 
ae Ss. Vr, f (sd14) 
iQ, 
0<x0<1 Vi, f (sd15) 
O<xce, <1 Vi, f (sd16) 
xb, binary Vint (sd17) 
delplus , , delneg , 2.0 ee (sd18) 


B. ABILITY CALCULATION 

The Owens Bayes algorithm [Sands, Waters, and McBride 1999], which the 
CAT-ASVAB normally uses to calculate the ability after an examinee answers each item, 
assumes that if an examinee answers an item correctly, he receives a more difficult item 
next, and if he answers incorrectly, he receives an easier item [Krass 2005]. Because 
none of the shadow test variations above consistently follow this behavior, this thesis 


uses a different algorithm to estimate the ability after an examinee answers each item. 
20 


This algorithm, developed by Dan Segall of DMDC, unlike the Owens Bayes algorithm, 
is independent of the order the test administers the items and whether or not the test 
administers an item of higher difficulty to an examinee after a correct answer [Krass 
2005]. Calculation time is slower than the Owens Bayes algorithm, but it is still within 
30 seconds, which is our criterion for an acceptable solution time for a CAT [Krass 


2005]. 
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IV. RESULTS OF CAT-ASVAB OPTIMIZATION SIMULATIONS 


A. SETUP FOR SIMULATION 

To test the performance of the model, we run simulations for each shadow test 
variation. GAMS [GAMS 2006] generates all integer linear programs (ILP) and XA 
[Sunset 2003] solves them on a 1.7 GHz Dell workstation. We use a similar approach to 
Chang and van der Linden’s paper on item stratification and select a few ability levels for 
the simulations. Those ability levels are 0=—1.5, —1.0, —0.5, 0, 0.5, 1.0, and 1.5. For 
each of these ability levels, the simulation creates 500 examinees. Each examinee takes a 
test generated by each of the five variations. The first is the current implementation of 
the CAT-ASVAB, which administers items by maximum information (OM). This is the 
benchmark for comparing the other four variations. The other four shadow test variations 
come from a CAT-ASVAB optimization formulation: the variation derived from Kunde’s 
paper and pencil formulation adapted for the CAT (KM), the variation with constraints on 
the difficulty parameters (DM), the variation using item stratification (SM), and the 
variation with item stratification and difficulty parameter constraints (SDM). 

Discretization of ability levels provide information only for those values of 0 
selected. But we have high confidence for those ability levels. This discretization also 
corresponds to an underlying assumption that examinee ability levels follow a uniform 
distribution. An alternative strategy would be to sample from a continuous distribution 
(for example, the standard normal). Previous CAT research has observed that sampling 
from a continuous distribution of 8 would imply using enormous sample sizes to get 
reasonable estimates of the bias and mean squared error (MSE) functions, which still 
would have to be pooled over classes of 8 values and be accurate only near the center of 
the distribution [Chang and van der Linden, 2003]. There are two consequences from 
this assumption. “First, the results for the bias and MSE functions are conditional on 0 
[Chang and van der Linden, 2003].” But, because the accuracy of these functions are not 
dependent on the distribution of the examinees, one can generalize the results for the bias 
and MSE to any population of examinees. “Second, the results for the item exposure 
rates do not necessarily generalize to other populations of examinees [Chang and van der 


Linden, 2003].” 
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The item pool contains approximately 170 items and comes from the 
Mathematical Knowledge test for the CAT-ASVAB [Sands, Waters, and McBride 1999]. 
These items are an experimental set and are not an actual item pool currently in use for 
the CAT-ASVAB. Each shadow test variation has about 2,500 constraints, 350 binary 
variables, and 2,000 continuous variables. 

The initial ability estimate for each test variation is 8 = 0. After the simulated 
examinees take the tests, the simulation outputs a set of deviations between the true and 
estimated 0 for each examinee. Then using S-Plus 6.2 [Insighful 2003], we run a 
Wilcoxon Sign-Rank Test to compare each shadow test variation’s deviation distribution 


to OM [e.g. Conover 1999]. Table | gives the parameters used for the shadow tests. 





For all Shadow Test Variations 





Forms per Shadow Test 2 





Number of Items per Form 15 
Scaling Factor (D) 1.7 
Number of items required from taxonomy 
group 1 ( NITEM, ) 2 











Number of items required from taxonomy 
group 2 ( NITEM, ) 

Number of items required from taxonomy 
group 3 ( NITEM, ) 

Number of items required from taxonomy 
group 4 ( NITEM, ) 

For DM and SDM 

Maximum allowable deviation of difficulty 
from current ability (bLimit) 

For SM and SDM 

Number of items required from strata | (S, ) 











0.5 











Number of items required from strata 2 ( S, ) 








3 
4 
Number of items required from strata 3 (S, ) 4 
4 


Number of items required from strata 4 ( S, ) 














Repetitions (or number of examinees) 500 





Table 1: Parameter Settings for Formulations 
There are five variations altogether (the four shadow test variations and 
OM) with 3,500 repetitions for each (500 repetition for seven given ability 
levels). 


B. RESULTS 

Table 2 shows the taxonomy distribution for the simulations. The simulation 
altogether selects 52,500 items (15 items for each of the 3,500 tests) for each test 
variation. OM performs poorly in terms of the taxonomy constraints specified. A 


majority of items administered in the OM simulation come from taxonomy group 3. This 
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is most likely because in the item pool, 103 of the 170 items are in taxonomy group 3. 
On the other hand, the four shadow test variations follow the taxonomy constraints shown 


in Table 1. 

















Taxonomy 
Group KM, DM, SM, and SDM OM 
1 7000 2858 
2 14000 6028 
3 28000 40150 
4 3500 3464 

















Table 2: Taxonomy Distribution 
Taxonomy distribution for OM heavily favors taxonomy group 3, while 
the taxonomy distribution for KM, DM, SM, and SDM follow the 
parameters set by the simulation (shown in Table 1) 


Table 3 shows the solution times of each shadow test variation. The times include 
the program generation, runtime, and output time for GAMS. KM and DM have 
acceptable results with maximum solution times under 10 seconds. The item 
stratification variations, SM and SDM, however, have higher maximum solution time. 
The long solution time occurred primarily at the selection of the 12" item, which is the 
beginning of the 4" and final stage. With the exception of that item, solution times are as 
quick as the other variations for the selection of the rest of the items in the test. If 
needed, the maximum solution times could possibly be reduced by using direct problem 


generation or another solver. But, we do not explore these options in this thesis. 























Solution Time (seconds) 
Shadow Test 
Variation Max Min Average 
KM 7.731 0.24 0.472 
DM 3.245 0.27 0.522 
SM 1036.66 | 0.27 1.865 
SDM 189.012 | 0.34 3.924 

















Table 3: Solution Times 
The solution time for KM, DM, SM, and SDM, on average, is acceptable. 
But, the high maximum solution times for SM and SDM make them 
infeasible options. 


Figure 2 shows the exposure rates of the items for each variation. They are 
calculated by dividing the number of times the item is administered by the number of 


tests. The x-axis lists the items in descending order according to their exposure rates. 
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Although SM and SDM start off much higher, all of the shadow test variations eventually 
end up approaching a more uniform distribution than OOM. OM has the highest amount of 
unused items at 77 items. SDM and SM have the next highest number of unused items at 
37 and 34 items respectively. Of even more concern, however, are the extremely high 
exposure rates with SM carrying a maximum exposure rate of 1 and SDM carrying a 
maximum exposure rate of 0.86. The problem items, although different for each 
variation, are distributed at the start of the exam. A possible reason for this is that items 
at the beginning of the test have a lower discrimination. So their Sympson and Hetter 
parameters are very high (close to or equal to 1), making the test much less likely to 
reject the items. Therefore, the Sympson and Hetter algorithm would rarely reject an 
item at the first stage. KM and DM administered all of the items in their simulations. As 
the graph shows, the curves for KM and DM have the flattest slopes, which indicate low 


maximum exposure rates and low number of unutilized items. 





Exposure Rates 











Exposure Rates 

















Items 











Figure 2: Exposure Rates: 
OM is given by a solid line. KM is given by a thin dashed line. DM is 
given by a bold dashed line, SM is given by a thin dotted line, and SDM is 
given by a bold dotted line. The x-axis lists the items in descending order 
according to their exposure rates. 
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Figures 3-7 below are the histograms of the errors for each test variation. The 


error for each examinee’s estimated ability is: 


A 


0-4; 
where b, is the estimated ability level of examinee & after the exam, and 6, is examinee 


k’s true ability level. There are 3,500 examinees for each test variation (500 examinees 
for each of the seven pre-selected ability levels). The Wilcoxon Sign Rank test p-values 
are given in Table 4. For this simulation, we use a two-sided test to determine whether 
there is a difference between the mean and medians of each shadow test variation’s 
deviation distribution to that of OM. Using a 90% Confidence Interval, a p-value of 
under 0.05 would indicate a significant difference between the means and medians of a 
given formulation against OM. The p-values for DM and SDM are equal to zero. 


Therefore, DM and SDM differ significantly from OM. 





p-values overall 
KM DM SM SDM 
0.1417 0 0.2489 0 


























Table 4: p-values versus OM for Wilcoxon Sign Rank Test 
DM and SDM significantly differ from OM because their p-values are 
below 0.05. 





OM Histogram of Errors 


Frequency 
w 
S$ 
8 


100 + 


0 ARR E WWiewen.... 


& NON & N nN nN 
nP AP AP NT AD A 9? oP gh oF 9? oF 9? gh gh Sob oF OP oF OF oF oh PO? SAD AM AP AP A? ys? 


4-4, 















































Figure 3: Error Histogram of OM 
The x-axis gives the error range for 90 (given by 6,-9,); The y-axis gives 
the frequency for the errors 
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KM Histogram of Errors 
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Figure 4: Error Histogram of KM 
The x-axis gives the error range for @ (given by 6-9); The y-axis gives 
the frequency for the errors 
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Figure 5: Error Histogram of DM 
The x-axis gives the error range for @ (given by 6,-9,); The y-axis gives 
the frequency for the errors 
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SM Histogram of Errors 
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Figure 6: Error Histogram of SM 
The x-axis gives the error range for 8 (given by 6,-9,); The y-axis gives 
the frequency for the errors 
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Figure 7: Error Histogram of SDM 
The x-axis gives the error range for 8 (given by 6,-9,); The y-axis gives 
the frequency for the errors 


Figures 8 and 9 below show the bias and mean squared error (MSE) functions. 
The values in the graphs are discrete with polynomial interpolation (from MS Excel) to 


obtain the intermediate values. In terms of the bias functions, each test variation 
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performs similarly with a large bias for more extreme negative ability levels. The graphs 
are also consistent with the results from the Wilcoxon Sign Rank Test. The KM and SM 
curves have steep slopes like the OM curve at the extreme negative values of 8. OM 
performs better than KM and SM for most of the curve, and performs better than all of 
the shadow test variations at 8 => 0.5. This is not surprising as there are no taxonomy 
constraints on OM. The two variations that were shown to be significantly different than 
OM, DM and SDM, have a flatter slope and do not have the steep negative slope at the 
extreme negative ability levels. Of particular note, DM performs better than OM for 
most of the curve at 9<0.5. Also, with the exception of 8 = —0.5 where the magnitude of 
the bias is only slightly higher than that of OM, SDM performs better than OM at the 


same regions as DM. 


Because the bias functions for each variation behave similarly, it is not surprising 
that the MSE functions for each variation do as well, with large errors as 8 approaches the 
extreme negative values. OM performs the best for most of the curve, 8 > 0, and 
performs better than KM and SM for all values of 8. Like the bias curve, the MSE curves 
for DM and SDM are flatter than OM, and therefore perform better at extreme negative 
values of 8, with DM’s MSE lower than SDM’s MSE for the whole curve. 
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Ability 











Figure 8: Bias Function: 
OM is given by a solid line. KM is given by a thin dashed line. DM is 
given by a bold dashed line, SM is given by a thin dotted line, and SDM is 
given by a bold dotted line. 
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Figure 9: MSE Function: 
OM is given by a solid line. KM is given by a thin dashed line. DM is 
given by a bold dashed line, SM is given by a thin dotted line, and SDM is 
given by a bold dotted line. 
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V. CONCLUSIONS AND FUTURE RESEARCH 


A. CONCLUSIONS 

The simulation results show that the current implementation of the CAT would 
benefit from the use of shadow tests. The primary motivation behind using the shadow 
tests for the CAT-ASVAB is to control taxonomy. This thesis introduces integer linear 
program (ILP) formulations that achieve this objective while our computational 
experience shows that the current method of item selection for the CAT-ASVAB (OM) 
has a taxonomy distribution that heavily favors one taxonomy group. In the area of item 
exposure, there are also significant benefits over OM. There are fewer unutilized items 
for each shadow test variation. In the case of the first and second shadow test variation 
(KM and DM), all items are administered, and maximum exposure rates are also lower 
than OM. The consequence of using the shadow test variations instead of OM is a slight 
loss in precision. As stated in Chang and van der Linden’s paper, “the loss (in accuracy) 
can be made up for by adding a few items to the test, whereas the loss in credibility for a 
testing program due to item compromise or the financial loss involved in inefficient item 
usage is much more difficult to compensate [Chang and van der Linden, 2003].” 

Given the five metrics for the simulation (bias, mean squared error (MSE), 
exposure rates, solution times, and taxonomy distribution), DM would be the most 
recommended amongst the shadow test variations. Like the rest of the shadow test 
variations, it meets the taxonomy constraints, with the solution time on average being the 
fastest. It actually has a lower bias for most of the curve than OM. Finally, the mean 
squared error (MSE) is the second lowest next to OM and even has a lower MSE at the 
negative values of 8. On the other hand, because of the high maximum exposure rates 
and maximum solution times, the shadow test variations with item stratification (SM and 


SDM) would not be recommended, despite also having a close bias and MSE to OM. 


B. FUTURE RESEARCH 


Because an experimental set of items comprises the item pool for this thesis 


simulation, further research can use an existing or future item pool to execute the 


33 


formulations. Also, only data for the Mathematical Knowledge (MK) test is used. 
Therefore item pools for the other CAT-ASVAB tests can be used in future research. 
Another area that can be extended is the sampling of the examinees. One could use a 
continuous distribution instead of sampling discrete values of 8. Also, this thesis only 
uses MSE and bias, whereas the current CAT-ASVAB uses the Birnbaum Score 
Information Function to measure precision of the exam [Sands, Waters, and McBride 


1999]. Therefore, future research can also use this function. 
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